Status: Draft
Jan 28, 2021
The disparities in the COVID-19 pandemic along racial and ethnic lines have exposed longstanding health inequities in the U.S., as made clear by multiple analyses of cases and deaths (CRDT, NYT, APM, KFF, NPR). However, these analyses were all based on incomplete data due to the fragmented data landscape for race/ethnicity breakdowns, which has largely been largely left to non-governmental organizations collecting data from individual state public health websites. For case data in particular, as opposed to deaths data, the CDC has only published public race/ethnicity data for cases at the U.S. level, not the state or county levels. Indeed, ASPE, an agency within HHS, wrote in Oct 2020 that "The volunteer-based COVID tracking project has created the most comprehensive centralized resource for race and ethnicity data at the state level."
In July 2020, the New York Times (NYT) published The Fullest Look Yet at the Racial Inequity of Coronavirus, a one-time analysis of data from the CDC obtained via FOIA and legal action that contained county-level case data with race/ethnicity up to May 28, 2020. While several non-governmental organizations have taken it upon themselves to gather data for total case counts at the county level (NYT, JHU, USAFacts), none of them have collected race/ethnicity data, which would be a huge undertaking due to the non-uniformity of race/ethnicity categories in state and local public health websites.
In Nov 2020, the CDC made some of the case data that the NYT obtained public: county-level totals in a dashboard and public data about race/ethnicity with additional dashboards, but without state and county details. They also released restricted access data with race/ethnicity, state, and county available upon request. The CDC's initial restricted access data agreement did not allow for county-level analyses to be made public, but an updated data agreement from Dec 14, 2020 allowed such public analyses. In Jan 2021, the Morehouse School of Medicine's Satcher Health Leadership Institute (MSM/SHLI), in collaboration with Citizens for Responsibility and Ethics in Washington (CREW) and Google.org, applied for and got access to this data.
The CDC Restricted Access data enabled us to complete the first public analysis of race/ethnicity disparities across the U.S. at the county level since the NYT analysis in July. However, the underlying data has significant completeness issues; e.g., only 80% of total cases are included and only 54% of cases have known race/ethnicity and county. In the table below, which compares the CDC/NYT data to the CDC data and totals from the Covid Tracking Project (CTP), we can see that some of the completeness measures did improve since the NYT obtained the data.
Sources: NYT article and The Daily podcast episode about the article.
The goal of this analysis is to assess the completeness of the CDC's Restricted Access data and its feasibility in examining disparities in race/ethnicity for COVID-19 cases at the county level. We will first assess the completeness of the data on its own by looking at which fields are viable for analysis. We will next compare the total case counts in the restricted access data to two comparable public datasets at the state and county levels. We will also compare the cases with known race/ethnicity at the state level to the Covid Racial Data Tracker (CRDT) data.
The top-level data completeness findings are:
After examining the completeness of the data, we will finally examine race/ethnicity disparities at the county level for data up to Dec 16.
Note that we will not analyze any data about deaths or hospitalizations. While there are fields in the CDC data that indicate if the person died or was hospitalized, they are missing too many values to be reliable. There are also alternate data sources, such as the CDC Provisional Deaths data, which are based on different underlying data that may be more complete than the case data we are looking at here.
The restricted access data contains 32 fields, which are described on the CDC website. The public version of the restricted access data contains 12 of those fields. The data comes from this case report form that is a dense, two-page form to get information about each lab-confirmed or probable COVID-19 case. The CDC has extensive FAQs about this surveillance data, one of which is about completeness:
How complete are the data that the CDC receives about COVID-19 cases?
The COVID-19 pandemic has put unprecedented demands on the public health data supply chain. In many states, the large number of COVID-19 cases has severely strained the ability of hospitals, healthcare providers, and laboratories to report cases with complete demographic information, such as race and ethnicity. The unprecedented volume of cases has also limited the ability of state and local health departments to conduct thorough case investigations and collect all requested case data.
As a result, many COVID-19 case notifications submitted to CDC do not have complete information on patient demographics; signs and symptoms of illness; underlying health conditions; characteristics of hospitalizations such as ventilator use; clinical outcomes; exposures; and factors that may put people at higher risk for severe disease. Because it can be time-consuming for jurisdictions to collect the additional information, these data can lag behind the aggregate counts. Because of missing data, analyses of these data elements are likely an underestimate of the true occurrence.
Based on our analysis of the CDC data up to Dec 16, 2020, the only fields that are available for more than 50% of the cases are the date that the case was first reported to the CDC, the status of the case (lab-confirmed or probable), state, county, sex, age, and race/ethnicity, which are shown in the chart below. All other fields, including whether the person died or was hospitalized, are known for fewer than 50% of the cases.
Race/ethnicity is known for only 55% of cases, while the other fields above are known for 97%-99% of cases. The 45% of cases without known race/ethnicity were either marked as "Unknown" on the case report form (35%), missing due to being left blank on the form (4%), or suppressed for privacy reasons for small geographic and/or demographic population groups (2%).
The CDC discussed the incompleteness of race/ethnicity data in their case data FAQs:
Most states have demographic factors like age and sex for most reported cases. However, in many states, the large number of COVID-19 cases has severely strained the ability to report cases with complete demographic information for race and ethnicity. With thousands of cases being reported, completeness of these elements is unlikely to improve in the immediate future for some jurisdictions.
The remaining fields, including whether the person died or was hospitalized, are all known for fewer than 50% of cases.
The CDC also commented on these fields in their case data FAQs:
Because of the volume of cases, most health departments are unable to conduct investigations of every case to obtain additional information. Because of this, most case reports are missing data on patient demographics, symptoms, underlying health conditions, characteristics of hospitalizations such as ventilator use, and other factors such as recent travel history.
The case report form contains many more fields, but unfortunately, the data gets less complete as you go down the form. CREW obtained a version of this data via FOIA that contains 101 fields with data up to Aug 25, 2020. Several of the additional fields from that dataset are shown below; the field with the most known data is whether the case was associated with an outbreak, but even that is only known for 30% of cases.
The first step to evaluating the completeness of the CDC data is to check the total case counts at the U.S., state, and county levels against known accurate data sources that aggregate state and local public health websites. The CDC case data FAQs say that we should not expect case data to always match the more accurate aggregate data, but that's a tradeoff we must make to get more detailed demographic information:
Aggregate counts provide the most up-to-date validated numbers on cases and deaths.
CDC receives the line-level data primarily from state health departments without personal identifiers such as names or home addresses. Because it can be time-consuming for jurisdictions to collect the additional information, these data can lag behind the aggregate counts. Although CDC receives this information for most cases, it does not receive it for all cases.
Many public health websites do contain race/ethnicity details, but they do not all use the same standard race/ethnicity categories (CRDT analysis). So, we must sacrifice accuracy and timeliness to get standardized race/ethnicity data on cases across states and counties.
We will compare the CDC data against two sources of aggregate data: The Covid Racial Data Tracker (CRDT) and the NYT's public data, which are both updated on a regular basis (CRDT twice a week, NYT daily) and come from state and local public health websites. CRDT is the only source for case data with race/ethnicity breakdowns, but there are several sources for county-level aggregate case data in addition to the NYT, such as JHU and USAFacts (this paper analyzes the differences between those sources at the state level up to July for cases and deaths).
The table below compares geographic vs. race/ethnicity availability for these three different data sources:
Because the CDC is the only data source that has race/ethnicity at the county level, the most similar data for purposes of comparison are (1) CRDT data at the state level with race/ethnicity, and (2) NYT data at the county level with no race/ethnicity.
We will compare across these data sources up to Dec 16, 2020, which is the latest reporting date in the CDC data. We expect to see some variation in the case counts due to lags in reporting the data, but we don't expect that time lags can explain large percentages of missing cases.
To get a baseline of how much we could expect the CDC case counts to match the CRDT or NYT, we can see how closely the CRDT and NYT match each other. Each dot below is a state (hover to see details), and the black line shows where the NYT and CRDT case counts are equal.
The ratio of NYT to CRDT cases is between 0.97 and 1.11 for all states:
We can also view these ratios on a map (hover over states for details).
We can see below that the CDC case counts differ from the CRDT case counts much more drastically than the NYT did.
The ratio of CDC to CRDT cases is between 0.03 and 1.64 for the 50 states plus D.C.:
In other words, 63% of states in the CDC data are within +/-15% of the CRDT case counts and 71% of states are within +/-50% of those counts.
Here are the ratios shown on a map:
The 32 states that were within +/-15% of the CRDT data could plausibly be off due to time lags in reporting cases to the CDC vs. reporting them on state public health websites, but there are many outlier states that are too far off from the CRDT case counts to be explained by a time lag:
We can do the same analysis at the county level using the CDC vs. NYT data.
Each dot is a county (hover to see details). We show all counties on the left and zoom in on the smaller counties on the right.
The ratio of CDC to NYT cases is between 0.00 and 9.80 for the 3,045 counties in the CDC data:
In other words, 48% of counties in the CDC data are within +/-15% of the NYT case counts and 70% are within +/-50% of those counts.
We can also view these ratios on a map. Note that legend only goes to 2.0, and all counties with a larger ratio are shown in the same dark blue.
We can now look at the completeness of the CDC data for race/ethnicity on its own and in comparison to the CRDT data at the state level.
We can first compare the number of cases with known race/ethnicity within each state between the CDC and CRDT data.
The ratio of CDC to CRDT cases with known race/ethnicity is between 0.01 and 1.18 for all states excluding New York, which has 0 known cases in CRDT.
Only 4 states (Massachusetts, Minnesota, Utah, Washington) had more cases with known race/ethnicity in the CDC data than in the CRDT data, whereas 23 states had more total cases in the CDC data than in the CRDT data. We can again view this comparison on a map:
What accounts for the differences between the CDC and CRDT for the number of cases with known race/ethnicity?
Comparing the number of cases with known race/ethnicity combines these two factors into one. We can separate the factors by comparing the total case counts by state, which we already did above, and separately comparing the percentage of cases with known race/ethnicity by state.
When calculating disparities between different race/ethnicity groups, we will need to be cautious to draw conclusions from data in states where there is race/ethnicity data for a small percentage of the population and/or the overall case totals are incomplete. For example, California only has 21% of cases with race/ethnicity, and 0% of the cases are Hispanic/Latino (96 out of 1.9M), which doesn't match state or local reporting.
We don't have a point of comparison for the known race/ethnicity percentage at the county level, as we do at the state level, but we can look at the percentage of cases with known/race ethnicity in the CDC data on its own to see the variation across states and counties.
We can now look at disparities between different race/ethnicity groups for the entire U.S.
In the chart below, AIAN stands for "American Indian / Alaska Native" and "NHPI" stands for "Native Hawaiian / Pacific Islander."
The CDC case data shows 4.09% of the U.S. population having had COVID-19, whereas the CRDT data shows 5.12% of the U.S. population having had COVID-19 up to Dec 16 (based on the CDC data only having 80% of the total cases in the CRDT data). Note that the Total group is larger than all of the other groups because it also includes the 45% of cases in the data that didn't have known race/ethnicity.
We can also look at the percent of each age and race/ethnicity group who had COVID-19.
We can see above that people age 20-29 are more likely to get COVID-19 than any other age group in almost all race/ethnicity groups. Because different race/ethnicity groups have different age compositions, splitting the cases into cases per age and race/ethnicity group allows us to compare race/ethnicity data against each other without the different age compositions complicating the comparison.
We can also look at the age-adjusted case rates, which uses a standard age composition across all race/ethnicity groups to weight the values within each age group. This allows us to compare the rate of COVID-19 within each race/ethnicity group and remove age composition differences as a factor from the comparison. The CDC has published age-adjusted prevalence data for deaths, but not for cases.
We can see below that, unlike for COVID-19 deaths, the crude and age-adjusted numbers are fairly similar within each race/ethnicity group except for Asian/NHPI, where the age-adjusted rate is 1.2 percentage points higher. Note that we combined those Asian and NHPI into one category to calculate age-adjusted numbers due to the availability of Census/ACS data with those age and race/ethnicity breakdowns.
We can now examine the race/ethnicity disparities at the county level. We first look at the percentage of people who had COVID-19 within each county. We show all counties with data even if they have a small percentage of cases with known race/ethnicity. Note that the legend only goes to 20%, so counties with a higher rate will be shown in the same dark brown color. You can hover over the counties in the map for more details.
Larger versions of these maps for hovering over smaller counties are available here.
We can also view disparities by comparing the percentage of total cases that a race/ethnicity group accounts for in a county (the cases share) vs. the percentage of the total population that a race/ethnicity accounts for in a county (the population share). There is no disparity when the cases share is equal to the population share for all race/ethnicity groups in a county (ratio = 1.0). When the ratio of cases share to population share is above 1.0, then a group has a disproportionate number of cases relative to its share of the population.